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ABSiRACI: Recently, Marr and Poggio (1979) presented a theory of human stereo vision. An im- 
I/IcnictUatioii of that theory is presented, and consists of five steps: (L) The left and right, images arc each 
Hit .-rod with masks of four sizes that increase with eccentricity; the shape of these masks is given b£v 2 G, 
Lhc laplacian o( a gausstan 1 unction. (2) Zero-crossings in the filtered images arc found along horizontal can 
lin.es. (3) for each mask size, matching takes place between zero-crossings of the same sign and roughly the 
same orientation m the two images, for a range of disparities up to about the width of the mask’s central 
region. Within this disparity range, Marr and Poggio showed that false targets pose only a simple problem. 
(4) lhc output of the wide masks can control vergencc movements, tints causing small masks to come into 
correspondence. In this way, the matching process gradually moves from dealing with large disparities at a 
low insolation to dealing with small disparities at a high resolution. (5) When a correspondence is achieved, 
it is stored in a dynamic bufler, called the 21-dimensional sketch. To suppoit the sufficiency of die Marr- 
Poggio model of human stereo vision, the implementation was tested on a wide range of stereograms from 
the human stercopsis literature. The performance of the implementation is illustrated and compared with 
human perception. As well, statistical assumptions made by Marr and Poggio arc supported by comparison 
with statistics found in practice. Finally, the process of implementing the theory has led to the clarification and 
refinement of a number of details within the theory; these are discussed in detail. 
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1. Introduction 

If two objects are separated in depth from a viewer, then tire relative positions of their images will 
diffei in the two eyes. This difference in relative positions — the disparity — may be measured and used to 
estimate depth. The process of stereo vision, in essence, measures this disparity and uses it to compute depth 
information for surfaces in the scene. 

The steps involved in measuring disparity arc (Marr and Poggio, 1979): (SI) a particular location on a 
surface in the scene must be selected from one image; (S2) that same location must be identified in the other 
image; and (S3) the disparity between die two corresponding image points must be measured. The difficulty 
of die pioblcm lies in steps (SI) and (S2), that is, in matching the images of the same location — the so* 
called cortespondence problem. For the case of the human stereo system, it can be shown that this matching 
takes place very early in the analysis of an image, prior to any recognition of what is being viewed, using 
primitive descriptors of the scene. This is illustrated by die example of random dot patterns. Julesz (1960) 
demonstrated dial two images, consisting of random dots when viewed monocularly, may be fused to form 
Patterns separated in depth when viewed stcrcoscopically. Random dot stereograms arc particularly interesting 
because when one tries to set up a correspondence between two arrays of dots, false targets occur in profusion. 
A false target refers to a possible but incorrect match between elements of die two views. In spite of such 
false targets, and in the absence of any monocular or high level cues, we arc able to determine the correct 
correspondence. Thus, the computational problem of human stcrcopsis reduces to that of obtaining primitive 
desciiptions of locations to be matched from the images, and of solving the correspondence problem for dicse 
descriptions. 

A computational theory of die stereo process for the human visual system was recently proposed by Marr 
and I oggio (1979). According to this theory, the human visual processor solves die stereoscopic matching 
problem by means of an algorithm that consists of five main steps: (1) The left and right images are each 
filtcicd at diflcicnt oiicntations with bar masks of four sizes diat increase with eccentricity; these masks have 
a cross-section that is approximately the difference of two gaussian functions, with space constants in die ratio 
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1.1.75. Such masks essentially perform the operation of a second directional derivative after low pass filtering 
or smoothing, and can be used to detect changes in intensity at different scales. (2) Zero-crossings in the 
filteted images are found by scanning them along lines lying perpendicular to the orientation of the mask. 
Since convolving die image with the masks corresponds to performing a second directional derivative, the 
zero-crossings of the convolutions correspond to extrema in the first directional derivadve of the image and 
thus to sharp changes in die original intensity function. (3) For each mask size, matching takes place between 
zeio-crossing segments of die same sign and roughly the same orientation in the two images, for a range of 
disparities up to about the width of the mask’s central region. Within this disparity range, Marr and Poggio 
showed diat false targets pose only a simple problem, because of the roughly bandpass nature of the filters. 
(4) The output of the wide masks can control vcrgcnce movements, thus causing smaller masks to come into 
correspondence. In diis way, die matching process gradually moves from dealing with large disparities at low 
lesolution to dealing with small disparities at high resolution. (5) When a correspondence is achieved, it is 
stored in a dynamic buffer, called die 21-dimensional sketch (Marr and Nishiharu, 1978). 

An important aspect in die development of any computational theory is die design and implementation 
of an explicit algoridim for diat dicory. There are several benefits from such an implementation. One concerns 
die act of implementation itself, which forces one to make all details of die theory explicit. This often uncovers 
previously overlooked difficulties, dicrcby guiding further refinement of die dicory. 

A second benefit concerns the performance of die implementation. Any proposed model of a system 
must be testable. In this case, by testing on pairs of stereo images, one can examine die performance of the 
implementation, and hence of die dicory itself, provided, of course, that the implementation is an accurate 
representation of that theory. In this manner, die performance of the implementation can be compared with 
human performance. If die algorithm differs strongly from known human performance, its suitability as a 
biological model is quickly brought into question (c.f. die cooperative algorithm of Marr and Poggio (1976)). 

This article describes an implementation of die Marr-Poggio stereo theory, written with particular em¬ 
phasis on the matching process (Crimson and Marr, 1979). For details of the derivation and justification of the 
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theory, see Marr and Poggio (1979). 

The first part of tins paper describes the overall design of the implementation. Several examples of the 
implementation’s performance on different images are then discussed, including random dot stereograms from 
the human stereopsis literature such-as with one image dcfocussed, noise introduced into part of the images’ 
spectra, and so forth. It is shown that the implementation behaves in a manner similar to humans on these 
special cases. Thirdly, the theory makes some statistical assumptions; these are compared with die actual 
statistics found in practice. Next, some points.about the theory that were clarified as a result of writing the 
piogram are discussed. Finally, die results of running the program on some natural images are shown. 


2. Design of the program 

The implementation is divided into five modules, roughly corresponding to die five steps in the summary 

above. These modules, and die flow of information between them, are illustrated in Figure 1. Each of the 
components is described in turn. 

2.1 Input 

There arc two aspects of die human stereo system, embedded in the Marr-Poggio theory, which must be' 
made explicit in the input to the algorithm. The first is the position of die eyes with respect to the scene, as eye 
movements will be critical for obtaining fine disparity information. The second is die change in resolution of 
analysis of die image with increasing eccentricity. 

To account for these effects, die algoridim maintains as its initial input a stereo pair of images, repre¬ 
senting die entire scene visible to the viewer. This pair of images corresponds to die environment around die 
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fMgurc 1. Diagram of the algorithm. The images of the scene arc mapped into the images of the retinas, taking 
into account the eye positions. Each image is convolved with a set ofdiircrcnt sized musks and zero-crossings 
are located for each convolution. For each size mask, the left and right zero crossing descriptions are matched. 

1 hese matched descriptions are combined into a single representation. As well, the matches from the larger 
channels can drive eye vergence movements, causing new retinal images to be created and allowing the smaller 
channels to come into correspondence. 
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visual system, rather than some integral part of the system itself. To create this representation of the scene, 
natural images were digitized on an Optronix Photoscan System P1000. The sizes of these images are indicated 
in the legends. Grey-level resolution is 8 bits, providing 256 intensity levels. For the random dot patterns 
illustrated in this article, the images were constructed by computer, rather than digitized from a photograph. 

For a given position of the eyes, relative to the scene, a representation of the images on the two retinas is 
extracted. The algorithm creates this retina! representation by obtaining a second, smaller pair of images from 
the images representing the whole scene. The mapping from the scene images into tire retinal images accounts 
for the two factors inherent in the Marr-Poggio theory. First, different sections of the scenes will be mapped to 
the centei (fovea) of the retinal images as the positions of the eyes are varied. Since the matching process will 
take place on tire array representing tire retinal images, it is important that the coordinate systems of those ar¬ 
rays coincide with the current positions of the eyes. Note drat tire portion of the scene image which is mapped 
into the retinal image may differ for the two eyes, depending on the relative positions of the two optical axes. 
In particular, there may be differences in vertical alignment as well as in horizontal alignment. Second, the 
Marr-Poggio theory also states that the resolution of the earlier stages of dre algorithm — dre convolution and 
zero-crossings — scales linearly with eccentricity. The most convenient method for dealing with tiris fact is to 
account for dre scaling with eccentricity at the level of the extraction of the images. Tiris means drat rather 
dran extracting a set of retinal images in a linear manner, we may map the scene into dre retinal images by 
a mapping whose magnification varies with eccentricity. By so doing, the later stages of processing need not 
explicitly account for the variation with eccentricity. Rather, these processes arc considered as operating on a 
uniform grid. Note that tiris eccentric mapping is not essential, especially for small images. In most of the cases 
illustrated in this article, the mapping was not used. 

After dre completion of this stage, the implementation has created a representation of the images that 
has accounted for eye position and for retinal scaling with eccentricity. For each pass of the algorithm, 
dre matching will take place on the representation of dre retinal images, diereby implicitly assuming some 
Particular eye positions. Once the matching has been completed, the disparity values obtained may be used to 
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change the positions of the two optic axes, tints causing a new pair of retinal images to be extracted from the 
representations of the scene, and the matching process may proceed again. 

2.2 Convolution 

Given the retinal representations of the images, it is then necessary to transform them into a form upon 
which the matcher may operate. Marr and Poggio (1979) argued that tire items to be matched in an image 
must be in one-to-one correspondence with well-defined locations on a physical surface. This led to the use of 
image ptedicates which correspond to changes in intensity. Since these intensity changes can occur over a wide 
range of scales within a natural image, they are detected separately at different scales. This is in agreement with 
tire findings of Campbell and Robson (1968), who showed that visual information is processed in parallel by 
a number of independent spatial-frequency-tuned channels, and with the findings of Julesz and Miller (1975) 
and Mayhew and Frisby (1976), who showed that spatial-frequency-tuned channels arc used in stcreopsis and 
arc independent. Recent work by Wilson and Bergen (1979) and Wilson and Gicse (1977) provided evidence 
foi the particular form of these spatial-frcquency-tuncd operators. Measuring-contrast sensitivity to vertical 
line stimuli, Wilson and his collaborators showed that the image is convolved with an operator which in one 
dimension may be closely approximated by a difference of two gaussian functions (DOG). 

In the original theory (Marr and Poggio, 1979), the proposed masks were oriented bar masks whose cross- - 
section was a difference of two gaussians, as given by the Wilson and Bergen data. If an intensity change 
occurs along a particular orientation in the image, there will be a peak in the first directional derivative of 
intensity, and a zero-crossing in the second directional derivative. Thus, the intensity changes in the image 
can be located by finding zero-crossings in die output of a second directional derivative operator. However, 
a number of practical considerations have led Marr and Hildreth (1979) to suggest that the initial operators 
not be directional operators. The only non-dircctional linear second derivative operator is the Laplacian. Marr 
and I Iildieth have shown that provided two simple conditions on tire intensity function in the neighbourhood 
of an edge aic satisfied, the zero-crossings of the second directional derivative taken perpendicular to an edge 
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will coincide with the zero-crossings of tire Laplacian along tliat edge. Therefore, theoretically, we can detect 
intensity changes occuring at all orientations using the single non-oriented Laplacian operator. Thus, Marr and 
Hildreth propose that intensity changes occuring at a particular scale may be detected by locating the zero- 
ciossings in the output of V 2 G, the Laplacian of a gaussian distribution. The operator, together with its fourier 
transform, is illustrated in Figure 2. The form of the operator is given by: 


V 2 GM) = 




Given the form of the operators, it is only left to determine the size of these masks. To do this, we 
first note that Marr and Hildreth (1979) showed that the operator V 2 G is a close approximation to the DOG 
function. Wilson and Bergen’s data indicated DOG filters whose sizes — specified by tire width w of the filter’s 
central excitatory region — range from 3.1’ to 21’ of visual arc. The variable w is related to tire constant o of 
V 2 G by the relation: 

w 

a — -. 

2x/2 

Wilson and Bergen's values were obtained by using oriented line stimuli. To obtain the diameter of the 
corresponding circularly symmetric center-surround receptive field, die values of w must be multiplied by 
V2. Finally, we want tire resolution of tire initial images to roughly represent tire resolution of processing by 
the cones, and the size of tire filters to representthe size of tire retinal operators. In tire most densely packed 
region of tire human fovea, the center-to-ccntcr spacing of the cones is 2.0 to 2.3 pm, corresponding to an 
angular spacing of 25 to 29 arc seconds (O'Brien, 1951). Accounting for the conversion of Wilson and Bergen’s 
data, and using the figure of 27 seconds of arc for the separation of cones in die fovea, one arrives at values of 
w in the range 9 to 63 image elements, and hence, values of a in dre range 3 to 23 image elements. 

Recently, it has been proposed (Marr, Poggio and Hildreth, 1979) that a further, smaller channel may be 
present. This channel would have a central excitatory width of w = 1.5’, roughly corresponding to 4 image 
elements. 
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Fisure 2. The operate G" and WG. The top left figure show G", the second derivative of a one-dimensional 
guassian distribution. The top right figure shows V‘G, its rotationaily symmetric two-dimensional counterpart. 
The bottom figures show their Fourier transforms. 
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The present implementation uses four filters, each of which is a radially symmetric difference of 
gaussians, with w values of 4, 9,17 and 35 image elements. The coefficients of the filters were represented to 
a precision of 1 part in 2048. Coefficients of less than ^-’th of die maximum value of the mask were set to 

zero. Thus, the truncation radius of the mask (the point at which all further mask values were treated as zero) 
was approximately 1 . 8 u>, or equivalently, 0 . 68 « 7 . 

1 he actual convolutions were performed on a LISP machine constructed at the MIT Artificial Intelligence 
Laboratory, using additional hardware specially, designed for the purpose (Knight, etal. 1979). Figures 3 and 4 
illustrates some images and their convolutions with various sized masks. 

After the completion of this stage of die algorithm, one has four filtered copies of each of die images, each 
copy having been convolved with a different size mask. 

2.3 Detection find description of zero-crossings 

According to the Marr-Poggio theory, die elements diat are matched between images arc (i) zero- 
crossings whose orientations are not horizontal, and (ii) terminations. The exact definition and hence the 

detection of terminations is at present uncertain; as a consequence, only zero-crossings arc used as input to the 
matcher. 

Since, for the purpose of obtaining disparity information, wc may ignore horizontally oriented segments, 
the detection of zero-crossings can be accomplished b, scanning the convolved image horizontally for adjacent 
elements of opposite sign, or for three horizontally adjacent elements, the middle one of which is zero, the 

other two containing convolution values of opposite sign. This gives die position of zero-crossings to within an 
image element. 

In addition to their location, wc record die sign of die zero-crossings (whether convolution values change 
from positive to negative or negative to positive as wc move from left to right) and a rough estimate of die 
local, two-dimensional orientation of pieces of the zero-crossing contour. In the present implementation, the 
orientation at a point on a zero-crossing segment is computed as the direction of the gradient of die convolu- 
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Figure 3. Examples of convolutions with V’G. The top figure shows a natural image. The bottom figures show 

the convolution of this image with a set of V*C operators. The sizes of these operators are * = 36,18,3 and 
4 image elements. 
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Figure 4, Examples of convolutions with V 2 G. The top figure shows a random dot pattern. The bottom 
figures show the convolution of this image with a set of V 2 (? operators. The sizes of these operators are 
w — 36,18, 9 and 4 image elements. 
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tion values across that segment, and recorded in increments of 30 degrees. Figures 5 and 6 illustrate zero- 
crossings obtained in this way from the convolutions of Figures 3 and 4. Positive zero-crossings are shown 
white, and negative crossings, black. 

We compute this zero-crossing description for each image and for each size of mask. 

2.4 Matching 

. The matcher implements the second of the matching algorithms described by Marr and Poggio (1979, 
P-315). For each size of filter, matching consists of 6 steps: 

(1) Fix the eye positions. 

(2) Locate a zero-crossing in one image. 

(3) Divide the region about the corresponding point in the second image into three pools. 

(4) Assign a match to the zero-crossing based on the potential matches within the pools. 

(5) Disambiguate any ambiguous matches. 

(6) Assign the disparity values to a buffer. 

These steps may be repeated several times during the fusion of an image. Given a position for the optic 
axes, these matching steps are performed, with the results stored in a buffer. These results may be used to 
refine the eye positions, causing a new set of retinal images to be extracted from the scene, and the matching 
steps are performed again. 

We now expand upon each of the six steps of the matching process. The first step consists of fixing the 
two eye positions. The alignment between the two zero-crossing descriptions, corresponding to the positions 
of the optical axes, is determined in two ways. The initial offsets of the descriptions are arbitrarily set to zero. 
Thereafter, the offsets of the two optical axes are determined by accessing the current disparity values for 
a region and using these values to adjust the vcrgence of the eyes. In this implementation, this is done by 
modifying the extinction of the retinal images from the images of the entire scene, accounting for the positions 
of the optical axes. 
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Figure 5. Examples of zero-crossing descriptions. The top figure show a natural image. The bottom figures 
show the zero-crossings obtained from the convolutions of Figure 3. The white lines mark positive zero- 
crossings and the black lines, negative ones. 
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Figure 6. Examples of zero-crossing descriptions. The top figure show a random dot pattern. The bottom 
figures show the zero-crossings obtained from the convolutions of Figure 4. The white lines mark positive 
zero-crossings and the black lines, negative ones. 
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Once the eye positions have been fixed, and the retinal images extracted, tire images are convolved with 
the DOG filters, and die zero-crossing descriptions are extracted from the convolved images. For a zero- 
crossing description corresponding to a particular mask size, tire matching is performed by locating a zero- 
crossing and executing the following-operation. Given the location of a zero-crossing in one image, a horizon¬ 
tal region about the same location in the other image is partitioned into three pools. These pools form the 
region to be searched for a possible matching zero-crossing and consist of two larger convergent and divergent 
regions, and a smaller one lying centrally between them. Together these pools span a disparity range equal to 

2 ta, where za is the widdi of the central excitatory region of the corresponding two-dimensional convolution 
mask. 

The following criteria are used for matching zero-crossings in the left and right filtered images, for each 

pool: 

(1) the zero-crossings must come from convolutions with the same size mask. 

(2) the zero-crossings must have tire same sign. 

(3) the zero-crossing segments must have roughly the same orientation. 

A match is assigned on die basis of the number of pools containing a matching zero-crossing. If exactly 
one zero-crossing of the appropriate sign and orientation (within 30 degrees) is found within a pool, the 
location of diat crossing is transmitted to die matcher. If two candidate zero-crossings arc found within one 
pool (an unlikely event), die matcher is notified and no attempt is made to assign a match for the point in 
question. If die matcher finds a single crossing in only one of the dircc pools, diat match is accepted, and die 
disparity associated with die match is recorded in a buffer. If two or three of the pools contain a candidate- 
match, die algorithm records diat information for future disambiguation. 

Once all possible unambiguous matches have been identified, an attempt is made to disambiguate double 
or triple matches. This is done by scanning a neighbourhood about die point in question, and recording the 
disparity sign of die unambiguous matches within that neighbourhood. (Disparity sign refers to the sign of 
the pool from which die match comes: divergent, convergent or zero.) If the ambiguous point has a potential 
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match of the same disparity sign as die dominant type within the neighbourhood, then that is chosen as the 
match (this is the "pulling" effect). Otherwise, tire match at that point is left ambiguous. 

There is the possibility that tire region under consideration does not lie within the ±w disparity range 
handled by the matcher. This situation is detected and handled by the following operation. Consider die case 
in which the region does lie within the disparity range ±w. Excluding the case of occluded points, every zero¬ 
crossing in the region will have at least one candidate match (the correct one) in die other filtered image. On 
die other hand, if die region lies beyond the disparity range ±w, then the probability of a given zero-crossing 
having at least one candidate match will be less than 1. In fact, Marr and Poggio show that the probability of 
a zero-crossing having at least one candidate match in this case is roughly 0.7. We can perform die following 
operation in this case. For a given eye position, the matching algorithm is run for all die zero-crossings. Any 
crossing for which dierc is no match is marked as such. If the percentage of matched points in any region is 

less than a threshold of 0.7 dien die region is declared to be out of range, and no disparity values are accepted 
for diat region. 

The overall effect of the matching process, as driven from die left image, is to assign disparity values to 
most of die zero-crossings obtained from the left image. An example of the output appears in Figure 7. In 
dns array > a zero - c rossing at position (x, y) with associated disparity d has been placed in a three-dimensional 
array with coordinate (x, y, d). For display purposes, die array is shown in die figures as viewed from a point 
some distance away. The heights in die figure correspond to die assigned disparities. 

After completion of this stage of the implementation, we have obtained a disparity array for each mask 
size. I he disparity values are located only along die zero-crossing contours obtained from diat mask. 


2.5 Vergencc Control 

I he Marr-Poggio theory states that in order to obtain fine resolution disparity information, it is necessary 
diat the smallest channels obtain a matching. Since the range of disparity over which a channel can obtain 
a match is directly proportional to the size of the channel, this means diat the positions of die eyes must 


: . , • _ • •- • • ■. i. ' 
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Figure 7. Results of the algorithm. The top stereo pair is an image of a painted coffee jar. The next two figures 
show two orthographic views of the disparity map. The disparities are displayed as {x, y, c — ad{x, y)}, 
where c is a constant and d(x, y ) is tire difference in the location of a zero-crossing in the right and left images. 
For purposes of illustration, a has been adjusted to enhance the features of the disparity map. The left view 
of the disparity map shows the jar as viewed from the lower edge of the image, and the right view show the 
jar as viewed from the left edge of the image. Note that tire background plane appears tilted in the disparity 
map. This agrees with the fused perception. The second stereo pair is a 50% density random dot pattern. 
The bottom figure shows the disparity map as viewed orthographically from some distance away. All disparity 
maps are those obtained from the w — 4 channel. 
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be assigned appropriately to ensure that the corresponding zero-crossing descriptions from the two images 
are within a matchable range. The disparity information required to bring the smallest channels into their 
matchable range is provided by the larger channels. That is, if a region of the image is declared to be out of 
range of fusion by the smaller channels, one can frequently obtain a rough disparity value for that region from 
the larger channels, and use this to verge the eyes. In this way, die smaller channels can be brought into a 
range of correspondence. 

Thus, after the disparities from the different channels have been combined, diere is a mechanism for 
controlling vcrgence movements of the eyes. This operates by searching for regions of the image which do 
not have disparity values for the smallest channel, but which do have disparity values for the larger channels. 
These large channel values are used to provide a refinement to the current eye positions, diereby bringing the 
smaller channels into range of correspondence. Two possible mechanisms for extracting the disparity value 
from a icgion of die image include using the peak value of a histogram of the disparities in diat neighbour¬ 
hood, 01 using a local average of die disparity values. In the current implementation, the search for such a 
region proceeds outwards from die fovea. 


It should be noted here diat aldiough the use of disparity information from coarser channels to drive 
eye movements, allowing smaller channels to come into correspondence, is a necessary condition of the Marr- 
Poggio theoiy, it is not necessarily the only such condition. In other words, diere may be other modules of 
die visual system which can initiate eye movements, and diereby affect die input to die matching component, 
by altering the retinal images presented to the matcher. An example of this would be die evidence of Kidd 
ct al. (1979) concerning the ability of texture contours to facilitate stcrcopsis by initiating eye movements. 
However, such effects arc somewhat orthogonal to the question of die sufficiency of the matching component 
of die Marr-Poggio theory, since dicy affect the input to the matcher, but not die actual performance of the 
matching algorithm itself. 
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2.6 The 2 2 -Dimensional Sketch 

Once the separate channels have performed their matching, die results are combined and stored in a 
buffer, called die 2|-D sketch. There are several possible methods for accomplishing diis. As far as the Marr- 
Poggio theory is concerned, the important point is that some type of storage of disparity information occurs. 
(Perhaps the strongest argument for this is die fact that up to 2 degrees of disparity can be held fused in the 
fovea.) 

We shall outline two different possibilities for the combination of the dilferent channels. The mediod 
currently used in die implementation will be described below. A more biologically feasible method will be 
outlined in the discussion. 

One of the critical questions concerning the form of die 2^-D sketch is whether it reflects the scene or the 
retinal images. For all die cases illustrated in diis article, die sketch was constructed by directly relating die 
coordinates of the sketch to die coordinates of die images of the entire scene. That is, as disparity information 
was obtained, it was stored in a buffer at the position corresponding to die position in die original scene from 
which the underlying zero-crossing came. Since disparity information about the scene is extracted from several 
eye positions, in order to store diis information into a buffer, explicit information about the positions of the 
eyes is required. It will be argued in die discussion diat this is probably inappropriate as a model of the 

human system. However, for the purposes of demonstrating the effectiveness of the matching module, such a 
representation is sufficient. 

The actual mechanism for storing the disparity values requires some combination of the disparity maps 
obtained for each of the channels. Currently, die sketch is updated, for each region of the image, by writing 
in the disparity values from die smallest channel which is within range of fusion. Vcrgcncc movements arc 
possible in order to bring smaller channels into a range of matching for some region. Further, for diose regions 
of die image for which none of the channels can find matches, modification of die eye positions over a scale 
larger than diat of die vcrgcncc movements is possible. By diis mediod, one can attempt to bring those regions 
of die image into a range of fusion. 
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There are several possibilities for the actual method of driving the verge nee movements. Two of these 
were outlined in the previous section. 

The final output of the algorithm consists of a representation of disparity values in tire image, those 
disparities being restricted to positions in the image lying along zero-crossings segments. 

2.7 Summary of the process 

The complete algorithm, as currently implemented, uses four mask sizes. Initially, die two views of the 
scene are mapped into a pair of retinal images. These images are convolved with each mask.' The zero- 
crossings and their orientation arc computed, for each channel and each view. The initial alignments of 
the eyes determine the registration of the images. The matching of the descriptions from each channel is 
performed for this alignment. Any points with either ambiguous matchings or with no match are marked as 
such. 

Next, the percentage of unmatched points is checked, for all square neighbourhoods of a particular size. 
This size is chosen so as to ensure tit at the measurement of tire statistics of matching within that neighbour¬ 
hood is statistically sound. Only the disparity points of those regions whose percentage of unmatched points 
is below a certain threshold, determined by the statistical analysis of Marr and Poggio (1979), are allowed 
to remain. All other points are removed. Tire values which arc kept arc stored into a buffer. At this stage, 
vergcnce movements may take place, using information from tire larger channels to bring tire smaller channels 
into a range where matching is possible. Further, if there arc regions of die image which do not have disparity 
values at any level of channel, an eye movement may take place in an attempt to bring those portions of die' 
image into a range where at least the largest mask can perform its matching. 

Note that die matching process takes place independently for each of die four channels. Once the 
matching of each channel is complete, die results are combined into a single representation of the disparities. 

The final output is dius a disparity map, with disparities assigned along most portions of the zero-crossing 
contours obtained from die smallest masks. The accuracy of the disparities thus obtained depends on how 
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accurately the zero-crossings have been localized, which may, of course, be to a resolution much finer than the 
initial array of intensity values that constitutes the image. 


3. Examples and Assessment of Performance 

A standard tool in the examination of human stereo perception is the random dot stereogram (Julesz, 
1960, 1971). This is a pair of stereo images where each image, when viewed monocularly, consists only of 
randomly distributed dots, yet when viewed stercoscopically, may be fused to yield patterns separated in 
depth. Such patterns are a useful tool for analysing the stereo component of die human visual system, since, 
there are no visual cues other than the stereoscopic ones. We can test die sufficiency of the algorithm by 
comparing human perception with die performance of die algorithm on such patterns. As well, since random 
dot stereograms have well demarked disparity values, it is easy to assess die correctness of die algorithm’s 
performance on such patterns. 

Table 1 lists some of the matching statistics for various random dot patterns. These are illustrated in 
Figures 8-13 and discussed below. 

The first pattern consisted of a central square separated in depth from a second plane. The pattern had 
a dot density of 50% and its analysis is shown in Figure 7. Each dot was a square with four image elements 
on a side. For die algorithm, diis corresponds to a dot of approximately two minutes of visual arc. The total 
pattern was 320 image elements on a side. The central plane of die figure was shifted 12 image elements in 
one image relative to the other. The final disparity map assigned after the matching of die smallest channel 
had die following statistics. The number of zero-crossing points in die left description which were assigned 
a disparity was 11847. Of these 11847, 11830 were disparity values which were exactly correct, and an 
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Table 1. 


iHklitional 14 deviated by one image element from the correct value. Approximately 0.011% of the matched 
points, or loughly 3 points in 1.0000 were incorrectly matched. 

A similar test was run on patterns with a dot density of 25%, 10% and 5%. The results arc illustrated in 
Figure 8. 

For eaclt of these cases, the number of incorrectly matched points was extremely lot,. Those points 

which were assigned incorrect disparities ail oeenred a. ihe border between the two planes, that is, along the 
discontinuity in disparity. 

A more complex random dot pattern consisted of a wedding cake, I,nib Iron, four different plana, layers, 
each separated by 8 image elements, or 2 slot w isltlis. This is illustrated in figure <j. 

In litis cs.sc. the number of zero-trussing points assigned a disparity sets 11 IC2. Of those points, 11055 

wore assigned a disparity value which was exactly entree,, sunt an ad,.. til deviated from the correct value 

by one image element. Approximately 0.08% of the points were incorrectly matched. Again, these incorrect 
points nil occiircd at the bourn lattes between the planes. A second complex pattern is illustrated in figure 5. 
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Figure 8. The top stereo pair is a 25% density random dot pattern. The disparity map below it is displayed as 

in Figure 7. The bottom stereo pair is a 5% density random dot pattern. Its disparity map is shown below it. 
Both disparity maps are obtained from the w = 4 channel. 
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Figure 9. The top stereo pair is a 50% density wedding cake, composed of four planar levels. The disparity 
map is shown below it. The bottom stereo pair is a 50% spiral. The disparity map is shown below it, in a 
manner similar to Figure 7. Both disparity maps are obtained from the w — 4 channel. 
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The object is a spiral with a range of continuously varying disparities. 

There are a number of special cases of random dot patterns which have been used to test various aspects 
of the human visual system. The algorithm was also tested on several of these stereograms. They are outlined 
below and a comparison between the performance of the algorithm, and human perception is given. 

It is known that if one or both of the images of a random dot stereogram are blurred, fusion of the 
stereogram is still possible (Julesz 1971, p.96). To test the algorithm in this case, the left half of a 50% density 
pattern was blurred by convolution with a gaussian mask. This is illustrated in Figure 10. The disparity 
values obtained in this case were not as exact as in the case of no blurring. Rather, there was a distribution 
of disparities about the known correct values. As a result, the percentage of points that might be considered in¬ 
correct (more than one image element deviation from the correct value) rose to 6%. However, the qualitative 
performance of the algorithm is still that of two planes separated in depth. It is interesting to note that slight 
distribution of disparity values about those corresponding to the original planes is consistent with the human 
perception of a pair of slightly warped planes. 

Julesz and Miller (1975) showed that fusion is also possible in the presence of some types of masking 
noise. In particular, if the spectrum of the noise is disjoint from the spectrum of the pattern, it can be 
demonstrated that fusion of the pattern is still possible. Within the framework of the Marr-Poggio theory, this 
is equivalent to stating that if one introduces noise of such a spectrum as to interfere with one of the stereo 
channels, fusion is still possible among the other channels, provided die noise does not have a substantial 
spectral component overlapping other channels as well. This was tested on the algorithm by high pass filtering 
a second random dot pattern, to create die noise, and adding the noise to one image. In the case illustrated in 
Figures 10 and 11, die spectrum of die noise was designed to interfere maximally with the smallest channel. 
In die case shown by HNOISE1 and HNOISE2 in Table 1, die noise was added such that the maximum 
magnitude of the noise was equal to the maximum magnitude of die original image. HNOISE1 illustrates 
the performance of die smallest channel. HNOISE2 illustrates the performance of the next larger channel. It 
can be seen that for diis case, some fusion is still possible in the smallest channel, aldiough it is patchy. The 
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Figure 10. The top stereo pair is a 50% density pattern in which the left image has been blurred. The disparity 
map is shown below it. It can be seen that two planes are still evident, although they are not as sharply defined 
as in Figure 7 or Figure 8. The disparity map is that obtained from the w = 4 channel. The bottom stereo 
pan is a 50% density pattern. The left image has had high pass filtered noise added to it so that the maximum 
magnitude of the noise is equal to the maximum magnitude of the image. The disparity map shown is that 
obtained by the w — 9 channel. 









Stereo Implementation 


37 


E. Grimson 


Figure 11. The top stereo pair is a 50% density pattern. The left image has ahd high pass filtered noise added 
to it so that the maximum magnitude of the noise is half the maximum magnitude of the image. The top 
disparity map is that obtained from the w — 9 channel, while the next disparity map is that obtained from the 
w 4 channel. It can be seen that the w — 4 channel obtains a matching only in a few sections of tire image. 
The bottom stereo pair is a 50% density pattern in which the left image has been compressed in the horizontal 
direction. I he disparity map from the w = 4 is displayed below. It can be seen that the two planes are still 
evident, although the entire pattern appears slanted. This is in agreement with human perception. 






















Stereo Implementation 


39 


E Grimson 


next larger channel also obtains fusion. In both cases, the accuracy of the disparity values is reduced from 
the normal case. This is to be expected, since the introduction of noise tends to displace the positions of the 
zero-crossings. In the case shown by HNOISE3 and HNOISE4 in Table 1, the noise was added such that the 
maximum magnitude was twice that-of the maximum magnitude of the original image. Here, matching in die 

smallest channel is almost completely eliminated (HNOISE3). Yet matching in the next larger channel is only 
marginally affected (HNOISE4). . 

The implementation was also tested on the case of adding low pass filtered noise to a random dot pattern, 
with results similar to that of adding high pass filtered noise. Here, die larger channels arc unable to obtain a 
good matching, while the smaller channels are relatively unaffected. 

If one of die images of a random dot pattern is compressed in die horizontal direction, the human stereo 
system is still able to achieve fusion (Julesz 1971. p.213). The algorithm was tested on this case, and the results 
arc shown in Figure 11. It can be seen that the program still obtains a reasonably good match. The planes are 
now slightly slanted, which agrees with human perception. 

If some of die dots of a pattern are decorrelatcd, it is still possible for a human observer to achieve 
some kind of fusion (Julesz 1971, p.88). Two different types of decorrelation were tested. In the first type, 
incicasing percentages of the dots in die left image were decorrelatcd at random. In particular, the cases of 
10%, 20% and 30% were tried, and are illustrated in Figure 12. For the 10% case, (table entry Uncorrl) 
it can be seen that the algoridim was still able to obtain a good matching of the two planes, although the 
total number of zero-crossings assigned a disparity decreased, and the percentage of incorrectly matched 
points increased. When the percentage of decorrelatcd dots was increased to 20% (table entry Uncorr?.), die 
number of matched points decreased again, although the percentage of those which were incorrectly matched 
remained about the same. Finally, when the percentage of decorrelatcd dots was increased to 30% (table entry 
Uncorr3), die algoridim found virtually no section of die image which could be fused. 

The failure of the algoridim to match die 30% decorrelatcd pattern is caused by die component of die 
algorithm which checks diat each region of die image is within range of correspondence. Recall that in order 
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Figure 12. The top stereo pair is a 50% density pattern in which the left image lias had 10% of the dots 
decorrelatcd. The disparity map is shown below. The bottom stereo pair is a 50% density pattern in which 
the left image has had 20% of the dots decorrelatcd. The disparity map is shown below. Note that in this case 
there aie large regions of the image for which no match was made. 
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to distinguish between die case of two images beyond range of fusion (for the current eye positions) which 
will have only randomly matching zero-crossings, and the case of two image within range of fusion, the Marr- 
Poggio theory requires that the percentage of unmatched points is less than some direshold. This threshold 
is approximately 0.3, according to die statistical analysis of Marr and Poggio (1979). For the case of the 
pattern with 30% decorrelation, on the average, each region of the image will have roughly 30% of its zero- 
crossings different and hence the algorithm decides that the region is out of range of correspondence. Hence, 
no disparidtes are accepted for this region. 

For the algorithm, the computational reason for the failure to process patterns with 30% decorrelation 
is that it could not distinguish a correctly matched region of such a pattern from a region which was out of 
range of correspondence, but had a random set of matches for many of the points in the region. It is interesdng 
to note that many human subjects observe a similar behavior; that is, some kind of fusion for up to 20% 
decorrelation, although die fusion becomes increasingly weaker, and virtually no. fusion for patterns with 30% 

decorreladon. 

One can also decorrelate the pattern by breaking up all white triplets along one set of diagonals, and 
all black triplets along the other set of diagonals (Julesz 1971, p.87). The table entry Uncorrd indicates the 
matching statistics for this case. Again, it can be seen that die program still obtains a good match, as do human 
observers. The performance of the algorithm is illustrated in Figure 13. 


4. Statistics 

A number of parameters are important for the theory, which makes assumptions about them, and they 
have been measured on random dot images. The worst cases occur for patterns with a density of 50%, and 
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Figure 13. The top stereo pair is a 50% density pattern in which the left image has been diagonally dccorre- 
lated. Along one set of diagonals, every triplet of white dots has been broken by the insertion of a black dot, 
and along the other set of diagonals, every triplet of black dots has been broken by the insertion of a white dot. 
The disparity map is shown below. The bottom stereo pair is a special case of Panurn’s limit. The left image is 
formed by superimposing two slightly displaced copies of the right image. The disparity map is shown below, 
and consists of two superimposed planes. 
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TABU' Ob' STATISTICS 

parameter 

expected worst 

case-behavior 

large channel 

w - 35 

medium channel 

w = 17 

small channel 

w - 9 

average distance 
between zero-crossings 
of same sign 

2 w 

1.51 w 

1.88 w 

1.37 w 

probability of 

candidates in at 

most one pool 

>.50 

.77 

.75 

.69 

probability of 

Candidates in 

two pools 

<.45 

.21 

.25 

.31 

probability of 

candidates in all 

three pools 

<.05 

.02 

.01 

.01 

given a candidate 

near zero, 

probability of no 

other candidates 

>.9 

.33 

.85 

1 

1 

.87 i 

- J 


Table 2. 


for such patterns the worst case values encountered for the parameters have the values shown in Table 2. The 
theoretical worst case bounds used by Marrand Boggio appear for comparison. 
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5. Comments and Discussion 

Implementing a computational theory offers us die opportunity of testing its adequacy. In this case, I 
have found that the performance of the implementation coincides well with that of human subjects over a 
broad range of random dot test cases obtained from die literature, including defocussing of, compression of, 
and die introduction of various kinds of masking noise to one image of a random dot stereo pair. 

The process of implementing the theory also led to the following observations and refinements of the 
theory. 

(1) There are a number of quesdons concerning die form of die 2i-D sketch. The first critical question 
concerns whether the sketch reflects the initial or the retinal images. In die first case, die coordinates of the 
sketch would be directly related to die coordinates of die images of the endre scene. However, since disparity 
information about the scene is extracted from several eye positions, in order to store this information into a 
buffer with coordinate system connected to the image of die scene, explicit information about die positions of 
the eyes is required. For die computer implementation, this is possible, but for a model of the human visual 
system, it seems unlikely dial such information is available to the stereo process. In the second case, no such 
pioblem arises. Here, die coordinates of the sketch are directly related to the coordinates of the retinal images. 

Such a system would be rednocentric, reflecting the current positions of die eyes. This seems to be die most 
natural representation. 

The second question concerns the use of a fovea. Different sections of die images arc analyzed at different 
resolutions, for a given position of die optical axes. An important consequence of diis is that die amount of 
buffer space required to store the disparity will vary widely in the visual field, being much greater for the fovea 
than for the periphery. This also suggests the use of a rctinoccntric representation, because if one used a frame 
that had already allowed for eye-movements, it would have to have foveal resolution everywhere. Not only 
docs such a buffer waste space, but it does not agree with our own experience as pcrccivers. If such a buffer 

were used, we should be able to build up a perceptual impression of the world diat was everywhere as detailed 
as it is at the centre of the gaze, and this is clearly not the case. 
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The final point about the 2<-D sketch is that it is intended as an intermediate representation of the 
current scene. It is important for such a representation to pass on its information to higher level processes as 
quickly as possible. Thus, it probably cannot wait for a representation to be built up over several positions 
of the eyes. Rather, it must be refreshed for each eye position. Thus, a refinement to the implementation, as 
outlined above, would be to use a representation that is retinocentric, and which represents disparities with 
decreasing resolution as eccentricity increases. 

For the eases illustrated it, this article, the 2 J-D sketch was created b, storing (to resolution disparity 
values into a scene-centered representation. A second alternative is to store values from all channels into a 
retinocentric represcntation - usi ”S disparity values from the smaller channels where available, and the coarser 
disparities from tire larger channels elsewhere. In this way, a disparity representation for a single fixation of the 
eyes may be constructed, with disparity resolution varying across the retina. Such a method of creating the 2 
D sketch has been tested on the implementation, with good results. 

(2) The neighbourhood over which a search for a matching zero-crossing is conducted is broken into 
three pools. In the present implementation, the pools are used to deal with die ambiguous case of two 
matching zero-crossings, while the disparity values associated with a match arc represented to within a image 
element. A second possibility is to use the pools not only to disambiguate multiple matches, but also to assign 
a disparity to a match. Thus, a single disparity value, equal to the disparity value of the midpoint of the 
pool, would be assigned for a matching zero-crossing lying anywhere within the pool. In this scheme, only 
three possible disparities could be assigned to a zero-crossing: zero, corresponding to the middle pool, or ^2, 
corresponding to the divergent or convergent pools. 

Computer experiments show that either scheme will work. In the case of a single disparity value for each 
pool, the disparities assigned by the smallest channel are within an image clement of those obtained using 
exact disparities for each match. This modification was tried on both natural images and random dot patterns, 
and suggests that the accuracy with which the pools represent the match is not a critical factor. 

(3) Although tire Marr-Poggio matcher is designed to match from one image into the other, there is no 







Stereo Implementation 


48 


E. Grimson 


Inherent reason why the matching process cannot be driven from both eyes independently. In fact, there 
may be some evidence that this is so, as is shown by the following experiment of O. Braddick (1978) on an 
extension to Panum’s limiting case. First, a sparse random dot pattern was constructed. From this pattern, a 
partner was created by displacing the entire pattern by slight amounts to both the left and the right. Thus, for 
each dot in the right image, there corresponded two dots in the left image, one with a small displacement to 
the left and one with a small displacement to the right. The perception obtained by viewing such a random dot 
stereogram is one of two superimposed planes. 

Suppose the matching process were only driven from one image, for example, matches were made from 
the right image to the left. In this case, the implementation would not be able to account for die Braddick 
peiccption, since all the zero-crossings would have two possible candidates. However, suppose that the match¬ 
ing process were driven independently from both tire right and left images, and an unambiguous match from 
either side accepted. In this case, although every zero-crossing in the right image would have an ambiguous 
match, tire implementation would obtain a unique match for each zero-crossing in tire left image. The 
implementation was designed to account for matching from either image. 

Braddick s case has been tested on the implementation, and the results arc shown in figure 13. It can be 
seen that the results of the implementation arc that of two transparent planes. 

(4) The points that were incorrectly matched in the test cases all lay along depth discontinuities. The 
major reason for this is connected with occlusion of regions. Note that at any depth discontinuity, there will 
be an occluded region which is present in one image, but not tire other. Any zero-crossings within that region 
cannot, of course, have a matching zero-crossing in tire other image. However, there is a certain probability 
of such a zero-ciossing being matched incorrectly to a random zero-crossing in the other image. In principle, 
the algorithm detects regions which arc occluded, by checking the statistics of the number of unmatched zero- 
crossings, and using such results to mark all zero-crossing matches in the region as unknown. However, for a 
region which contains a depth discontinuity, only part of die region will have tire above characteristics. Zero- 
ciossings in the rest of the region will have a unique match. Thus, when the statistical check on tire number 
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of unmatched points is performed, it is possible for the entire region to be considered in range, and thus all 
matches, including the incorrect ones of the occluded region, will be accepted. 

(5) It is interesting to comment on tire effect of depth discontinuities for the different sized masks. For 
random dot patterns, the zero-crossings obtained from the larger masks tend to outline blobs or clusters of 
dots. Thus in general, tire positions of tire zero-crossings do not correspond to single elements of the underly¬ 
ing image. Suppose tire dot pattern consists of one plane separated in depth from a second plane. In such a 
case, one might well find a zero-crossing that belongs at one end to dots on the first plane, and at the other end 
to dots belonging to the second plane. Such zero-crossings will be assigned disparities that reflect, to within 
the resolution of the channel, the structure of the image. The zero-crossings lying between the two ends will, 
however, receive disparities that smoothly vary from one extreme to tire other. The largest channel would thus 
not see a plane separated in depth from a second plane, but rather a smooth hump. 

For tire smaller mask this does not occur, as the zero-crossing contours tend to outline individual dots or 
connected groups of dots. Thus tire disparities assigned are such that the dots belong to one plane or tire other 
and the final disparity map is one of two separated planes. 

To achieve perfect results from stereo, it is probably necessary to include in tire 2 £-dimensional sketch 
a way of dealing competently with discontinuities. Some initial work has already been done in this direction 
(Grimson, in preparation). Interestingly, when one looks at a 5% random-dot stereogram portraying a square 

m front of its background, one sees vivid subjective contours at its boundary, although tire output of the 
matcher does not account for this. 

(6) One consequence of die Marr-Poggio theory is that explicit disparity values will be obtained only 
along the zero-crossing contours. It may be desirable to create a more complete reconstruction of the shapes of 
the objects in tire scene, by filling in disparity values between die zero-crossing contours. Some work has been 
done in this direction (Grimson, in preparation) and an example is shown in Figure 14. 

(7) An integral part of most computational theories, proposed as models of aspects of the human visual 
system, is die use of computational constraints based on assumptions about the physical world (Marr and 
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Figure 14. Example of filling in the disparity map. The top left figure is the initial image. The top right figure 
shows the disparity map associated with the image, where the disparity is represented by the intensity of die 
point. The bottom figures show the filled in map, again using intensity to represent disparity. In the left figure, 
die full range of disparity is shown, indicating the slant of the background plane, and the extreme difference in 
disparity between the jar and die background. In the right figure, die intensities have been adjusted to enhance 
die disparities of the jar, indicating die general shape of the interpolated surface. 
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Poggio, 1979, Marr and Hildreth, 1980, Ullman, 1979). The constraints so derived are critical in the formation 
of the computational theory, and in the design of an algorithm for solving the problem. An interesting ques¬ 
tion to raise is whether the algorithm explicitly checks that the constraints imposed by the theory are satisfied. 
For example, Ullman’s rigidity constraint in the analysis of structure from motion is explicitly checked by his 
algorithm. For the case of the Marr-Poggio stereo theory, two constraints were outlined, uniqueness and con¬ 
tinuity of disparity values. It is curious that in the algorithm used to solve the stereo problem, the continuity 
constraint is explicity checked while the uniqueness constraint is not. Uniqueness of disparity is required in 
one direction of matching, since only those zero-crossing segments of one image which have exactly one match 
in the second image are accepted. However, it may be the case that more than one element of the right image 
could be matched to an element of the left image, for matching in this direction. When matching from the 
right image to the left, the same is true. Note that one could easily alter the algorithm to include the checking 
of uniqueness, thereby retaining only those disparity values corresponding to zero-crossing segments with a 
unique disparity value when matched from both images. However, the evidence of Braddick discussed above 
would indicate that this is not the case. Hence, in the Marr-Poggio stereo theory, although both the require¬ 
ment of uniqueness and continuity are subsumed, only one of these two constraints is explicitly checked by the 
algorithm. 

(8) It is worth observing the distinction between the performance of the implementation on random 
dot patterns and the performance of the implementation on natural images. Some examples are shown in 
Figure 15. The main point is that on the whole, the performance is quite acceptable for random dot patterns. 
However, tire implementation can occasionally fail in the case of natural images. The question is whether this 
reflects a basic inadequacy in the theory and its implementation, or whether there are other aspects of the 
visual process interacting with stereo which have not been included in this implementation. 

This can be approached in two ways: (1) Is the assumption of modularity incorrect? In other words, is 
there something wrong with the matching module as developed by Marr and Poggio, and as implemented 
here. (2) Are there other modules, not considered here, which may affect the input or the output of the 
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Figure 15. Examples of natural images. The top stereo pair is a scene of a basketball game. The disparity map 
below is viewed from the side, so that the width of the black bars indicates the relative disparity. The bottom 
stereo pair is of a sculpture by Henry Moore. The disparity maps below it are also viewed from the side. The 
left map illustrates the extreme range of disparity between the trees in the background and the sculpture itself. 
The right map has been adjusted to enhance the disparities of the sculpture, indicating its form. 
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matching module? < ; ■ , 

The results of testing the implementation on die broad range of images, indicated in previous sections, 
seems to indicate that the matching module is acceptable as an independent one. In particular, the agreement! 
between the performance of die algorithm and that of human observers on the many random dot patterns 
seems to indicate that the matching module is acceptable, since in these cases, all other visual cues have been 
isolated from the matcher. 

When we turn to natural images, it is reasonable to expect that other visual modules may affect the input 
to die matcher and that they may alter the output of die matcher. This is not to suggest diat the matcher is 
incotrect, only that the effects of odier modules must be taken into account in order to explain the complete 
human perception. For example, die evidence of Kidd, Frisby and Mayhcw (1979) concerning die ability of 
texture boundaries to drive eye vergencc movements indicates that other visual information besides.disparity 
may alter the position of die eyes, and thus the input to die matcher. However, it does not necessarily imply 
that die matcher itself needs to be modified. 

Interestingly, the performance of the implementation supports tliis point. The implementation, which is 
considered a disdnet module, also performs very well on random dot patterns, where dierc is no possibility 
of interaction with other visual processes. For many natural images, this is still true. However, occasionally 
it is the case that a natural image provides some difficulty for die implementation. A particular example of 
this occurs in the image of Figure 16 . Here, the regular pattern of die windows provides a strong false targets 
problem. In running die implementation, the following behavior was observed. If the optical axes were aligned 
at the level of die building, the zero-crossings corresponding to die windows were all assigned a correct dis¬ 
parity. If, however, the optical axes were aligned at die level of die trees in front of die building, the windows 
were assigned an incorrect disparity, due to die regular pattern of zero-crossings associated with diem. Clearly, 
this seems wrong. Yet is die implementation wrong? Curiously, if one fuses the zero-crossing descriptions of 
the convolved images without eye movements, human observers have the same problem: if die eyes arc fixated 
at die level of the building, the windows arc correctly matched; if die eyes arc fixated at die level of the trees, 
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F-igurc 16. The false largest problem. The top figures arc a stereo pair of a group of buildings. The bottom 
figures show the zero-crossing descriptions of these images. The regular pattern of the windows of the rear 
building causes difficulties for the matcher, if the alignment of the eyes corresponds to fixating at the level 
of tltc building, the algorithm matches the zero-crossings corresponding to the windows correctly. If the 
alignment of the eyes corresponds to fixating at the level of the trees in front, of the building, the algorithm 

matches the zero-crossings corresponding to the windows incorrectly, [experiments indicate that under similar 
conditions humans have a similar perception. 
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the windows arc incorrectly matched. I wonld argue that this implies that the implementation, and hence the 
theory of the matching process is in fact correct. Given a particular set of aero-crossings, the module finds 
any acceptable matching and writes it into the SJ-D sketch. However, it is probably the ease that some later 
processing module, which examines the contents of the 2 J-D sketch, is capable of altering the contents stored 
there, based on more global information than is available to the matching component of the stereo process. 

Thus, I would suggest that future refinements to the Marr-Poggio theory must account for the interac- 
uons of other aspects of visual information processing on the input and output of the matching module. Some 
initial work has already been done in this direction (Grimson, in preparation). 
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to distinguish between die case of two images beyond range of fusion (for the current eye positions) which 
will have only randomly matching zero-crossings, and the case of two image within range of fusion, the Marr- 
Poggio theory requires that the percentage of unmatched points is less than some direshold. This threshold 
is approximately 0.3, according to die statistical analysis of Marr and Poggio (1979). For the case of the 
pattern with 30% decorrelation, on the average, each region of the image will have roughly 30% of its zero- 
crossings different and hence the algorithm decides that the region is out of range of correspondence. Hence, 
no disparidtes are accepted for this region. 

For the algorithm, the computational reason for the failure to process patterns with 30% decorrelation 
is that it could not distinguish a correctly matched region of such a pattern from a region which was out of 
range of correspondence, but had a random set of matches for many of the points in the region. It is interesdng 
to note that many human subjects observe a similar behavior; that is, some kind of fusion for up to 20% 
decorrelation, although die fusion becomes increasingly weaker, and virtually no. fusion for patterns with 30% 

decorreladon. 

One can also decorrelate the pattern by breaking up all white triplets along one set of diagonals, and 
all black triplets along the other set of diagonals (Julesz 1971, p.87). The table entry Uncorrd indicates the 
matching statistics for this case. Again, it can be seen that die program still obtains a good match, as do human 
observers. The performance of the algorithm is illustrated in Figure 13. 


4. Statistics 

A number of parameters are important for the theory, which makes assumptions about them, and they 
have been measured on random dot images. The worst cases occur for patterns with a density of 50%, and 
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Figure 13. The top stereo pair is a 50% density pattern in which the left image has been diagonally dccorre- 
lated. Along one set of diagonals, every triplet of white dots has been broken by the insertion of a black dot, 
and along the other set of diagonals, every triplet of black dots has been broken by the insertion of a white dot. 
The disparity map is shown below. The bottom stereo pair is a special case of Panurn’s limit. The left image is 
formed by superimposing two slightly displaced copies of the right image. The disparity map is shown below, 
and consists of two superimposed planes. 
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TABU' Ob' STATISTICS 

parameter 

expected worst 

case-behavior 

large channel 

w - 35 

medium channel 

w = 17 

small channel 

w - 9 

average distance 
between zero-crossings 
of same sign 

2 w 

1.51 w 

1.88 w 

1.37 w 

probability of 

candidates in at 

most one pool 

>.50 

.77 

.75 

.69 

probability of 

Candidates in 

two pools 

<.45 

.21 

.25 

.31 

probability of 

candidates in all 

three pools 

<.05 

.02 

.01 

.01 

given a candidate 

near zero, 

probability of no 

other candidates 

>.9 

.33 

.85 

1 

1 

.87 i 

- J 


Table 2. 


for such patterns the worst case values encountered for the parameters have the values shown in Table 2. The 
theoretical worst case bounds used by Marrand Boggio appear for comparison. 
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5. Comments and Discussion 

Implementing a computational theory offers us die opportunity of testing its adequacy. In this case, I 
have found that the performance of the implementation coincides well with that of human subjects over a 
broad range of random dot test cases obtained from die literature, including defocussing of, compression of, 
and die introduction of various kinds of masking noise to one image of a random dot stereo pair. 

The process of implementing the theory also led to the following observations and refinements of the 
theory. 

(1) There are a number of quesdons concerning die form of die 2i-D sketch. The first critical question 
concerns whether the sketch reflects the initial or the retinal images. In die first case, die coordinates of the 
sketch would be directly related to die coordinates of die images of the endre scene. However, since disparity 
information about the scene is extracted from several eye positions, in order to store this information into a 
buffer with coordinate system connected to the image of die scene, explicit information about die positions of 
the eyes is required. For die computer implementation, this is possible, but for a model of the human visual 
system, it seems unlikely dial such information is available to the stereo process. In the second case, no such 
pioblem arises. Here, die coordinates of the sketch are directly related to the coordinates of the retinal images. 

Such a system would be rednocentric, reflecting the current positions of die eyes. This seems to be die most 
natural representation. 

The second question concerns the use of a fovea. Different sections of die images arc analyzed at different 
resolutions, for a given position of die optical axes. An important consequence of diis is that die amount of 
buffer space required to store the disparity will vary widely in the visual field, being much greater for the fovea 
than for the periphery. This also suggests the use of a rctinoccntric representation, because if one used a frame 
that had already allowed for eye-movements, it would have to have foveal resolution everywhere. Not only 
docs such a buffer waste space, but it does not agree with our own experience as pcrccivers. If such a buffer 

were used, we should be able to build up a perceptual impression of the world diat was everywhere as detailed 
as it is at the centre of the gaze, and this is clearly not the case. 
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The final point about the 2<-D sketch is that it is intended as an intermediate representation of the 
current scene. It is important for such a representation to pass on its information to higher level processes as 
quickly as possible. Thus, it probably cannot wait for a representation to be built up over several positions 
of the eyes. Rather, it must be refreshed for each eye position. Thus, a refinement to the implementation, as 
outlined above, would be to use a representation that is retinocentric, and which represents disparities with 
decreasing resolution as eccentricity increases. 

For the eases illustrated it, this article, the 2 J-D sketch was created b, storing (to resolution disparity 
values into a scene-centered representation. A second alternative is to store values from all channels into a 
retinocentric represcntation - usi ”S disparity values from the smaller channels where available, and the coarser 
disparities from tire larger channels elsewhere. In this way, a disparity representation for a single fixation of the 
eyes may be constructed, with disparity resolution varying across the retina. Such a method of creating the 2 
D sketch has been tested on the implementation, with good results. 

(2) The neighbourhood over which a search for a matching zero-crossing is conducted is broken into 
three pools. In the present implementation, the pools are used to deal with die ambiguous case of two 
matching zero-crossings, while the disparity values associated with a match arc represented to within a image 
element. A second possibility is to use the pools not only to disambiguate multiple matches, but also to assign 
a disparity to a match. Thus, a single disparity value, equal to the disparity value of the midpoint of the 
pool, would be assigned for a matching zero-crossing lying anywhere within the pool. In this scheme, only 
three possible disparities could be assigned to a zero-crossing: zero, corresponding to the middle pool, or ^2, 
corresponding to the divergent or convergent pools. 

Computer experiments show that either scheme will work. In the case of a single disparity value for each 
pool, the disparities assigned by the smallest channel are within an image clement of those obtained using 
exact disparities for each match. This modification was tried on both natural images and random dot patterns, 
and suggests that the accuracy with which the pools represent the match is not a critical factor. 

(3) Although tire Marr-Poggio matcher is designed to match from one image into the other, there is no 
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Inherent reason why the matching process cannot be driven from both eyes independently. In fact, there 
may be some evidence that this is so, as is shown by the following experiment of O. Braddick (1978) on an 
extension to Panum’s limiting case. First, a sparse random dot pattern was constructed. From this pattern, a 
partner was created by displacing the entire pattern by slight amounts to both the left and the right. Thus, for 
each dot in the right image, there corresponded two dots in the left image, one with a small displacement to 
the left and one with a small displacement to the right. The perception obtained by viewing such a random dot 
stereogram is one of two superimposed planes. 

Suppose the matching process were only driven from one image, for example, matches were made from 
the right image to the left. In this case, the implementation would not be able to account for die Braddick 
peiccption, since all the zero-crossings would have two possible candidates. However, suppose that the match¬ 
ing process were driven independently from both tire right and left images, and an unambiguous match from 
either side accepted. In this case, although every zero-crossing in the right image would have an ambiguous 
match, tire implementation would obtain a unique match for each zero-crossing in tire left image. The 
implementation was designed to account for matching from either image. 

Braddick s case has been tested on the implementation, and the results arc shown in figure 13. It can be 
seen that the results of the implementation arc that of two transparent planes. 

(4) The points that were incorrectly matched in the test cases all lay along depth discontinuities. The 
major reason for this is connected with occlusion of regions. Note that at any depth discontinuity, there will 
be an occluded region which is present in one image, but not tire other. Any zero-crossings within that region 
cannot, of course, have a matching zero-crossing in tire other image. However, there is a certain probability 
of such a zero-ciossing being matched incorrectly to a random zero-crossing in the other image. In principle, 
the algorithm detects regions which arc occluded, by checking the statistics of the number of unmatched zero- 
crossings, and using such results to mark all zero-crossing matches in the region as unknown. However, for a 
region which contains a depth discontinuity, only part of die region will have tire above characteristics. Zero- 
ciossings in the rest of the region will have a unique match. Thus, when the statistical check on tire number 





Stereo Implementation 


49 


E. Grimson 


of unmatched points is performed, it is possible for the entire region to be considered in range, and thus all 
matches, including the incorrect ones of the occluded region, will be accepted. 

(5) It is interesting to comment on tire effect of depth discontinuities for the different sized masks. For 
random dot patterns, the zero-crossings obtained from the larger masks tend to outline blobs or clusters of 
dots. Thus in general, tire positions of tire zero-crossings do not correspond to single elements of the underly¬ 
ing image. Suppose tire dot pattern consists of one plane separated in depth from a second plane. In such a 
case, one might well find a zero-crossing that belongs at one end to dots on the first plane, and at the other end 
to dots belonging to the second plane. Such zero-crossings will be assigned disparities that reflect, to within 
the resolution of the channel, the structure of the image. The zero-crossings lying between the two ends will, 
however, receive disparities that smoothly vary from one extreme to tire other. The largest channel would thus 
not see a plane separated in depth from a second plane, but rather a smooth hump. 

For tire smaller mask this does not occur, as the zero-crossing contours tend to outline individual dots or 
connected groups of dots. Thus tire disparities assigned are such that the dots belong to one plane or tire other 
and the final disparity map is one of two separated planes. 

To achieve perfect results from stereo, it is probably necessary to include in tire 2 £-dimensional sketch 
a way of dealing competently with discontinuities. Some initial work has already been done in this direction 
(Grimson, in preparation). Interestingly, when one looks at a 5% random-dot stereogram portraying a square 

m front of its background, one sees vivid subjective contours at its boundary, although tire output of the 
matcher does not account for this. 

(6) One consequence of die Marr-Poggio theory is that explicit disparity values will be obtained only 
along the zero-crossing contours. It may be desirable to create a more complete reconstruction of the shapes of 
the objects in tire scene, by filling in disparity values between die zero-crossing contours. Some work has been 
done in this direction (Grimson, in preparation) and an example is shown in Figure 14. 

(7) An integral part of most computational theories, proposed as models of aspects of the human visual 
system, is die use of computational constraints based on assumptions about the physical world (Marr and 
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Figure 14. Example of filling in the disparity map. The top left figure is the initial image. The top right figure 
shows the disparity map associated with the image, where the disparity is represented by the intensity of die 
point. The bottom figures show the filled in map, again using intensity to represent disparity. In the left figure, 
die full range of disparity is shown, indicating the slant of the background plane, and the extreme difference in 
disparity between the jar and die background. In the right figure, die intensities have been adjusted to enhance 
die disparities of the jar, indicating die general shape of the interpolated surface. 
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Poggio, 1979, Marr and Hildreth, 1980, Ullman, 1979). The constraints so derived are critical in the formation 
of the computational theory, and in the design of an algorithm for solving the problem. An interesting ques¬ 
tion to raise is whether the algorithm explicitly checks that the constraints imposed by the theory are satisfied. 
For example, Ullman’s rigidity constraint in the analysis of structure from motion is explicitly checked by his 
algorithm. For the case of the Marr-Poggio stereo theory, two constraints were outlined, uniqueness and con¬ 
tinuity of disparity values. It is curious that in the algorithm used to solve the stereo problem, the continuity 
constraint is explicity checked while the uniqueness constraint is not. Uniqueness of disparity is required in 
one direction of matching, since only those zero-crossing segments of one image which have exactly one match 
in the second image are accepted. However, it may be the case that more than one element of the right image 
could be matched to an element of the left image, for matching in this direction. When matching from the 
right image to the left, the same is true. Note that one could easily alter the algorithm to include the checking 
of uniqueness, thereby retaining only those disparity values corresponding to zero-crossing segments with a 
unique disparity value when matched from both images. However, the evidence of Braddick discussed above 
would indicate that this is not the case. Hence, in the Marr-Poggio stereo theory, although both the require¬ 
ment of uniqueness and continuity are subsumed, only one of these two constraints is explicitly checked by the 
algorithm. 

(8) It is worth observing the distinction between the performance of the implementation on random 
dot patterns and the performance of the implementation on natural images. Some examples are shown in 
Figure 15. The main point is that on the whole, the performance is quite acceptable for random dot patterns. 
However, tire implementation can occasionally fail in the case of natural images. The question is whether this 
reflects a basic inadequacy in the theory and its implementation, or whether there are other aspects of the 
visual process interacting with stereo which have not been included in this implementation. 

This can be approached in two ways: (1) Is the assumption of modularity incorrect? In other words, is 
there something wrong with the matching module as developed by Marr and Poggio, and as implemented 
here. (2) Are there other modules, not considered here, which may affect the input or the output of the 
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Figure 15. Examples of natural images. The top stereo pair is a scene of a basketball game. The disparity map 
below is viewed from the side, so that the width of the black bars indicates the relative disparity. The bottom 
stereo pair is of a sculpture by Henry Moore. The disparity maps below it are also viewed from the side. The 
left map illustrates the extreme range of disparity between the trees in the background and the sculpture itself. 
The right map has been adjusted to enhance the disparities of the sculpture, indicating its form. 
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matching module? < ; ■ , 

The results of testing the implementation on die broad range of images, indicated in previous sections, 
seems to indicate that the matching module is acceptable as an independent one. In particular, the agreement! 
between the performance of die algorithm and that of human observers on the many random dot patterns 
seems to indicate that the matching module is acceptable, since in these cases, all other visual cues have been 
isolated from the matcher. 

When we turn to natural images, it is reasonable to expect that other visual modules may affect the input 
to die matcher and that they may alter the output of die matcher. This is not to suggest diat the matcher is 
incotrect, only that the effects of odier modules must be taken into account in order to explain the complete 
human perception. For example, die evidence of Kidd, Frisby and Mayhcw (1979) concerning die ability of 
texture boundaries to drive eye vergencc movements indicates that other visual information besides.disparity 
may alter the position of die eyes, and thus the input to die matcher. However, it does not necessarily imply 
that die matcher itself needs to be modified. 

Interestingly, the performance of the implementation supports tliis point. The implementation, which is 
considered a disdnet module, also performs very well on random dot patterns, where dierc is no possibility 
of interaction with other visual processes. For many natural images, this is still true. However, occasionally 
it is the case that a natural image provides some difficulty for die implementation. A particular example of 
this occurs in the image of Figure 16 . Here, the regular pattern of die windows provides a strong false targets 
problem. In running die implementation, the following behavior was observed. If the optical axes were aligned 
at the level of die building, the zero-crossings corresponding to die windows were all assigned a correct dis¬ 
parity. If, however, the optical axes were aligned at die level of die trees in front of die building, the windows 
were assigned an incorrect disparity, due to die regular pattern of zero-crossings associated with diem. Clearly, 
this seems wrong. Yet is die implementation wrong? Curiously, if one fuses the zero-crossing descriptions of 
the convolved images without eye movements, human observers have the same problem: if die eyes arc fixated 
at die level of the building, the windows arc correctly matched; if die eyes arc fixated at die level of the trees, 
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F-igurc 16. The false largest problem. The top figures arc a stereo pair of a group of buildings. The bottom 
figures show the zero-crossing descriptions of these images. The regular pattern of the windows of the rear 
building causes difficulties for the matcher, if the alignment of the eyes corresponds to fixating at the level 
of tltc building, the algorithm matches the zero-crossings corresponding to the windows correctly. If the 
alignment of the eyes corresponds to fixating at the level of the trees in front, of the building, the algorithm 

matches the zero-crossings corresponding to the windows incorrectly, [experiments indicate that under similar 
conditions humans have a similar perception. 
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the windows arc incorrectly matched. I wonld argue that this implies that the implementation, and hence the 
theory of the matching process is in fact correct. Given a particular set of aero-crossings, the module finds 
any acceptable matching and writes it into the SJ-D sketch. However, it is probably the ease that some later 
processing module, which examines the contents of the 2 J-D sketch, is capable of altering the contents stored 
there, based on more global information than is available to the matching component of the stereo process. 

Thus, I would suggest that future refinements to the Marr-Poggio theory must account for the interac- 
uons of other aspects of visual information processing on the input and output of the matching module. Some 
initial work has already been done in this direction (Grimson, in preparation). 
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